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Abstract —Existing deep convolutional neural networks (CNNs) 
have shown their great success on image classification. CNNs 
mainly consist of convolutional and pooling layers, both of 
which are performed on local image areas without considering 
the dependencies among different image regions. However, such 
dependencies are very important for generating explicit image 
representation. In contrast, recurrent neural networks (RNNs) 
are well known for their ability of encoding contextual infor¬ 
mation among sequential data, and they only require a limited 
number of network parameters. General RNNs can hardly be 
directly applied on non-sequential data. Thus, we proposed the 
hierarchical RNNs (HRNNs). In HRNNs, each RNN layer focuses 
on modeling spatial dependencies among image regions from 
the same scale but different locations. While the cross RNN 
scale connections target on modeling scale dependencies among 
regions from the same location but different scales. Specifically, 
we propose two recurrent neural network models: 1) hierarchical 
simple recurrent network (HSRN), which is fast and has low 
computational cost; and 2) hierarchical long-short term memory 
recurrent network (HLSTM), which performs better than HSRN 
with the price of more computational cost. 

In this manuscript, we integrate CNNs with HRNNs, and 
develop end-to-end convolutional hierarchical recurrent neural 
networks (C-HRNNs). C-HRNNs not only make use of the repre¬ 
sentation power of CNNs, but also efficiently encodes spatial and 
scale dependencies among different image regions. On four of the 
most challenging object/scene image classification benchmarks, 
our C-HRNNs achieve state-of-the-art results on Places 205, SUN 
397, MIT indoor, and competitive results on ILSVRC 2012. 

Index Terms —Deep Learning, Image Classification, Recurrent 
Neural Network, Convolutional Neural Network. 


1. Introduction 

O VER the last few years, deep convolutional neural net¬ 
works Q have brought a revolution in computer vision 
society by learning powerful representations based on large- 
scale datasets. Till now, CNNs have shown their success in but 
not limited to the following areas: image classification 0 - 0 . 
detection ©GD face recognition (ig, (Tg, etc. 

The key idea of CNNs is utilizing convolutional and pooling 
layers to progressively extract more and more abstract pat¬ 
terns. The convolutional layers convolve multiple local filters 
with input images (or outputs of previous layers), and aim 
to produce translation invariant local features. Afterwards, 
pooling layers are applied to summarize the feature responses 
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of the convolutional layers over multiple regions of images, 
and compress the size of the response maps. Both convolution 
and pooling are locally performed. For example, the repre¬ 
sentation of the top left image region will not influence the 
representation of the bottom right region. However, contextual 
information is very important for object/scene recognition. For 
example, in an image with label “beach”, if “sand” regions are 
represented with the reference of “sea” regions, then it is much 
easier to distinguish them from “road” or “desert sand”. In 
CNNs, spatial and scale dependencies among different image 
regions are not explicitly modeled. 

In this manuscript, we aim to encode contextual dependen¬ 
cies in image representation. To learn the dependencies effi¬ 
ciently and effectively, we propose a new class of hierarchical 
recurrent neural networks (HRNNs), and utilize the HRNNs 
to leam such contextual information. 

Recurrent neural networks (RNNs) have achieved great 
success in natural language processing (NLP) |T^-|T9|. RNNs 
® 0 are neural networks developed for modeling depen¬ 
dencies in sequences by using feedback connections among 
themselves. Thus, they can retain all the processed states 
in the sequence, and learn patterns from sequential context. 
Furthermore, because of the reuse of hidden layers, only a 
limited number of neurons need to be kept in the model. Two 
most popular RNN models are the simple recurrent neural 
network (SRN) and long-short term memory recurrent network 
(LSTM). Based on which, we will introduce hierarchical SRN 
(HSRN) and hierarchical LSTM (HLSTM). Both HSRN and 
HLSTM target on modeling the spatial and scale dependencies 
among different local image regions. However, they also 
have different characteristics: HSRN is simple and fast, while 
HLSTM is more complex but it is able to maintain the long¬ 
term dependencies among local image regions far away from 
each other, and lead to better performance than HSRN. 

Our proposed hierarchical recurrent neural networks 
(HRNNs) model two types of contextual dependencies: spatial 
dependencies and scale dependencies. 

Firstly, we consider the spatial dependencies among image 
regions from the same scale but at different locations. Since 
there are no off-the-shelf sequences in images, inspired by the 
multi-dimensional RNN | [22| , we generate two dimensional 
spatial region sequences for images, and represent each region 
as a function of its neighboring regions. Details will be 
described in Section IIII-Cll 

Secondly, we build multiple scale RNNs, and consider scale 
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Fig. 1: The overall framework of C-HRNNs. CNN layers: Extract mid-level representations for image regions by processing 
five convolutional and pooling layers. HRNN layers: (a) Pool the output of the fifth CNN layer into multiple scales, (b) 
For each scale, spatial dependencies are captured by direct or indirect connections between each region and its surrounding 
neighbors, (c) For different scales, scale dependencies are encoded by transferring information from the higher level scales to 
corresponding areas at the lower level scales. Take the l-th {I G [1, • * * scale as an example, information of the yellow 
block will be transferred to corresponding areas (highlighted by yellow) in the / + l-th to the L-th scales as reference. While 
the red block in the I + l-th scale will be transferred to its infiuenced areas (highlighted by red) in the I + 2-th to L-th scales. 
FC layers: Collect different scale HRNN outputs and connect to two fully connected layers. (Best viewed in color) 


dependencies among image regions from different scales but at 
the same locations. Information captured from different scales 
are complementary to each other. Connecting multiple scales 
can help to learn more robust representation. For example, in 
an image with label “car”, regions at a lower level scale mostly 
contain patterns such as “tire” and “window”, while regions 
at a higher level scale include global patterns such as “car”. 
Knowing the existence of “car” can help the system to increase 
the representation preference of “tire” in the corresponding 
local regions. Details will be described in Section |III-C2| 

However, HRNN layers are processed based on image re¬ 
gions, while in image classification, no intermediate labels for 
any of these regions are provided. The only supervision is the 
image-level labels. To make use of it, fully connected layers 
are introduced to collect the outputs of HRNN layers, merge 
them through the global hidden layer, and finally connect to 
image-level labels with a softmax layer. 

Integrating CNNs with our HRNNs, we propose end-to- 
end networks called convolutional hierarchical recurrent neu¬ 
ral networks (C-HRNNs). As shown in Figure C-HRNN 
not only maintain the discriminative representation power 
of CNNs, but also efficiently encode the spatial and scale 
contextual dependencies with HRNNs. Testing on four most 
challenging large-scale image classification benchmarks, C- 
HRNNs achieve the state-of-the-arts on Places 205, SUN 397, 
MIT indoor, and promising results on ILSVRC 2012. 

H. Related Works 

In recent few years, deep neural networks have made great 
break through in computer vision area. Till now, lots of 
successful deep neural nets with different structures have been 


proposed, such as: convolutional neural networksjf^, 
fT0| , p^, 1^, p4| , deep belief nets p5|-p7]|7^ and auto- 
encoder~|28|-|3 1 1 , etc. Among all these frameworks, CNNs 
are the most developed networks for solving image classifica¬ 
tion problems. The core idea of CNNs is progressively learning 
more abstract (higher visual level) and more complex patterns: 
the first few layers focus on learning “garbor like” low level 
local features (e.g. edges and lines); based on which, the 
middle layers target on learning parts of objects (e.g. “tires” 
and “windows” in the images with label “car”); the higher 
layers connect to the final image-level labels, and aim to learn 
representations of the whole image. 

In contrast, RNNs have achieved great success in natural 
language processing (NLP) |T4]|-|T9|, p2t , p^ . Different 
from the CNNs, which are purely combined with “feed¬ 
forward” network layers, RNNs p0| , | |^ are “feed-back” 
neural networks designed for modeling contextual dependen¬ 
cies. Because of the connections from the previous states 
to the current ones, RNNs are networks with “memory”. 
Through such “feed-back” connections, RNNs are able to 
retain information of the past inputs, and it is able to discover 
correlations among the input data that might be far away from 
each other in the sequence. 

Although very popular in NLP, RNNs have rarely been 
applied to computer vision area. In the recent decade, there 
are mainly five branches of works which involve the recurrent 
idea. 

In the first branch of works, recurrent layers are mainly used 
as “tied” layers in the “feed-forward” networks, which means 
different layers share the same parameters. Different from our 
recurrent networks, these “tied” layers iteratively encode the 
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input data from the same locations with the same network 
parameters, and these layers focus on reducing the number of 
parameters, rather than modeling the contextual dependencies 
among input data from different locations. In p4| , shared 
CNNs are applied to learn pixel label consistency among 
multi-scale image patches. In DrSAE p5| , auto-encoder with 
rectified linear units are employed to iteratively encode the 
global digital-number image. In j^, the “tied” CNN (called 
as recurrent convolutional network) is employed to assess the 
contributions of the number of layers, response maps, and 
parameters. Different from these works, the “recurrent” in 
our C-HRNNs means learning spatial and scale dependencies 
among different image regions, and expanding receptive fields 
of local regions by encoding contextual information. 

In the second branch of works, RNNs are used to pre¬ 
dict/generate the motion curve of objects/parts in the cur¬ 
rent/next moment, and applied to visual attention tasks. In 
p7| , | [38| , RNN is used to build a sequential variational auto¬ 
encoder to iteratively analyze/generate image parts (at each 
iteration, RNNs are used to selectively attends to parts of the 
image while ignoring the others). Differently, we aim to build 
end-to-end networks for large-scale image classification. 

In the third branch of works, RNNs | [3^ , | |4Ql are used 
to combine the video information over an ordered sequence 
of video frames for video recognition and description. Differ¬ 
ently, our C-HRNNs model the contextual dependencies within 
single image rather than the sequential appearance/motion 
dependencies among consecutive frames. 

In the fourth branch of works, RNNs are combined with 
CNNs for image/video description ED -|[44|. In these works, 
CNNs are utilized to generate image/video features, while 
RNNs are used to connect the image/video feature domain to 
the text feature domain, and RNNs mainly focus on modeling 
the text contextual dependencies in the sentences/paragraphs. 
Different from these works, our C-HRNNs models the con¬ 
textual information in image appearance domain. 

The last branch of works is RNN pyramid | [45| , | |46| . In 
these works, multiple layers of local recurrent connectivities 
are stacked as a pyramid to get different levels of visual 
abstractions. In contrast, C-HRNNs model the scale depen¬ 
dencies among image regions at the same level of visual 
abstraction, but different pooling scale. Moreover, C-HRNNs 
integrate the discriminative power of CNNs and contextual 
modeling ability of RNNs, and work efficiently and effectively 
for large-scale image classification. 

HI. Convolutional Hierarchical Recurrent 
Neural Networks 

As shown in Figure our proposed convolutional hierar¬ 
chical recurrent neural networks (C-HRNNs) consist of three 
types of layers: 1) five convolutional (and pooling) layers for 
extracting middle level image region features; 2) hierarchical 
recurrent layers for encoding spatial and scale dependencies 
among different image regions; 3) two fully connected layers 
for generating global image representation. Finally, an N-way 
(N indicates the number of categories) softmax loss layer is 
added on the top for classification. 



(a) General SRN 



(b) General LSTM 


Fig. 2: General SRN and FSTM structures, where the solid 
arrows represent the forward transformations, and the dashed 
arrows represent the recurrent connections from the previous 
states to the current ones, (a) SRN structure. In each state t of 
the sequence, there are two inputs for the hidden layer 
the current state input and the previous state hidden unit 
The predicted output label depends on (b) 
FSTM structure. Similar to SRN, but the long-term memory 
can be kept by the introduced intermediate gates (input 
forget output input modulation gates), and the 
memory cell 

A. Convolutional Layers 

As shown in the left part of Figure given input raw 
pixel images, firstly, five convolutional layers are processed 
to progressively extract more and more complex and abstract 
patterns. According to the analysis in | [47| , outputs of the fifth 
convolutional layer are able to capture patterns representing 
parts and objects. Furthermore, size of the fifth layer response 
maps is orders of magnitudes smaller than size of the original 
raw pixel images. Thus, based on such CNN features, our 
proposed HRNNs can model the contextual dependencies 
among middle-level regions with semantic meanings, and 
HRNNs can be processed very efficiently. Furthermore, with 
back propagation, RNNs can help the CNNs to increase the 
quality of middle-level and low-level features. 

Note that our HRNNs can be easily constructed based on 
any network other than CNN (e.g. deep restricted Boltzmann 
machi ne |T^ , auto-encoder |[35|), hand crafted features (e.g. 
SIFT 1481, HOG ||49|), or even from scratch. In this work, 
we choose CNNs because of their excellent performance on 
representing mid-level patterns, which is the guarantee of good 
performance of the following HRNNs. 

B. Review of General RNNs 

RNNs p0| , pTI are originally developed for modeling 
dependencies in time sequential data. In RNNs, two of the 
most typical models are the simple recurrent neural network 
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(SRN), and the long-short term memory recurrent neural 
network (LSTM). In the following two subsections, SRN and 
LSTM will be introduced to represent each state t of a given 
sequence of length T. and are the input, hidden 

and output representations of the t-th state respectively. 

1) Simple Recurrent Neural Nets: As shown in Figure 
the t-th state in SRN can be represented as: 

(Whhh^^-^^ + + hh) ( 1 ) 

(Whyh^^^ + by) (2) 

where Wxh, Whh and Why are the shared transformation 
matrices from input to hidden states, previous hidden to current 
hidden states, and hidden to output states, bh and by are bias 
terms, 'iph and are non-linear activation functions. Since 
the expression of each state is based on hidden representation 
of the previous states, SRN can keep “memory” of the whole 
sequence, and learn patterns based on such sequential context. 

Although simple and effective, SRN has the unpleas¬ 
ant “short term memory” problem | [50| : during the back- 
propagation procedure in SRN, the gradients will be multiplied 
T times by the Whh^ Consequently, when T is relatively large, 
there will be gradient vanishing/exploding problems. 

2) Long-Short Term Memory Recurrent Neural Nets: To 
overcome the above “short term memory” issue, LSTM | [50| 
introduce the “memory block” (combined with multiplication 
gates and memory cell) to keep long term flow of sequential 
information. As shown in Figure the t-th state in LSTM 


can be represented as: 

= (T + 6i) (3) 

/(*) = (T + bf) (4) 

= (T {Whoh^^-^^ + + bo) (5) 

5^ = </, (Whgh^^-^^ + + bg) (6) 

c(t) = f(t) 0 c(‘-l) + i(‘) 0 git) (7) 

h^t) = o(t) 0 0 l'g(t) j (g) 

= ^y {Whyh^^'^ + by) (9) 


in addition to the hidden state h^'^\ LSTM introduced a 
memory cell and four multiplication gates: 
o^^\ and which are the input, forget, output, and input 
modulation gate respectively, a is a logistic sigmoid function 
(thus, o^^^ range from [0,1]). (j) is the hyperbolic 

tangent nonlinearity, and 0 represents element-wise multipli¬ 
cation. Specifically, the self recurrent memory cell c^^^ keeps 
the long-term memory. The input gate controls the flow of 
incoming signal to alter the state of The forget gate 
helps the to selectively maintain and forget the previous 
state status While the output gate o^^^ controls the 

amount of memory that transmits to 

The “memory block” structure enables LSTM to selectively 
forget its previous memory states, and learn long-term dynam¬ 
ics which general SRN can hardly handle. However, LSTM has 


more intermediate neurons than SRN, thus LSTM consumes 
much more computational resources. 

In this manuscript, rather than modeling contextual corre¬ 
lations among different states in time sequences, we modify 
RNNs to model contextual dependencies among image region 
“2D sequences”. Details of our proposed networks will be 
introduced in the following section. 


C. Hierarchical Recurrent Layers 

In CNNs, convolution and pooling are locally performed on 
image regions. While the spatial dependencies among different 
regions from the same scale are ignored, let alone the scale 
dependencies among image regions from different scales. On 
the other hand, general RNNs (Section III-B| ) are designed for 
modeling dependencies in sequences, however, they cannot be 
directly applied on images. Thus, as shown in the middle part 
of Figure we propose hierarchical recurrent layers to model 
spatial and scale contextual dependencies. 

1) Modeling Spatial Contextual Dependencies: Spatial con¬ 
text is an important clue for recognizing images. For exam¬ 
ple, in an image with label “computer room”, knowing the 
existence of “computer” can help the system to increase the 
preference of representing “desk” in the surrounding image 
regions. In this subsection, we will introduce the spatial RNNs 
to model spatial contextual dependencies within single scale 
image feature maps. 

There are no existing sequences in images, hence we need 
to generate region sequences in image domain. Take Alex-net 
as an example, as described in Section |III-A[ we utilize 
the fifth layer CNN feature maps (256 x 6 x 6, corresponding 
to number of channels x height x width) as the input of 
the recurrent layers. It can be considered as a 6 x 6 2D 
data array, each element in the array is represented as a 256 
dimensional vector. Then how to convert such an 2D array 
into sequences? The most straight-forward way is to scan 
in a row by row or column by column manner. However, 
images are 2-dimensional data. For each element, contextual 
information from all the directions should be taken into con¬ 
sideration. Thus, inspired by | [22| , we generate “2D sequences” 
for images, and each element simultaneously receives spatial 
contextual references from its 2D neighborhood elements. 

As shown in (^) of Figure spatial contextual information 
comes from all directions (left, right, top, bottom). If we 
directly connect all the surrounding elements to the target, 
each node would simultaneously be the “previous” and “next” 
element of its neighbors. Then the connections would form a 
cyclic graph. The resulting network is difficult to be optimized. 
Thus, four directional “2D sequences” are generated for each 
scale: top-left to bottom-right, bottom-right to top-left, bottom- 
left to top-right, and top-right to bottom-left. Each of them 
focuses on transferring information from an independent direc¬ 
tion through an acyclic path. Take the top-left to bottom-right 
sequence (as shown in (a) of Figure as an example, each 
element receives references from its nearest neighbor elements 
in the previous row and the previous column. All the elements 
will be visited once, and each element can be unrolled into 
a function of all the previously visited elements. Similarly, 
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(e) 


Fig. 3: Overview of a single scale HRNN layer (6x6 image regions, one dashed box corresponds to one image region). (a)\ 
Transmit information from left and top spatial neighbors for each region, and process from top-left corner to bottom-right 
corner in an acyclic way. (b-d): Similar to (a), except the processing directions are bottom-right to top-left, bottom-left to 
top-right, and top-right to bottom-left respectively, (e): Combine (a-d), then each image region has direct or indirect contextual 
references from all the other regions. (Best viewed in color) 


contextual information from the other three directions can be 
encoded by “2D sequences” as shown in (b-d) of Figure 
For each of the four directional “2D sequences”, the trans¬ 
formation matrices are shared through the whole sequence. To 
model the spatial correlations among different image regions, 
general SRN and LSTM (Section III-B ) are modified to model 
our spatial sequences, and they are called spatial SRN and 
spatial LSTM in the rest of this section. 


Spatial SRN Firstly, the spatial SRN is introduced. The hidden 
representation of each image region in the “2D sequences” is: 


u{r,c) 





(10) 






(11) 






(12) 

h<^r,c) 





(13) 

hir,c) 

= h^::,c) 

(14) 


where (r, c) is the position of the element, is the input, 
which is an image region represented by a 256-dimensional 
fifth CNN layer feature vector, and 

denote the hidden representations of in the four “2D 

sequences” respectively (corresponding to top-left to bottom- 
right, bottom-right to top-left, bottom-left to top-right, and top- 
right to bottom-left directions), is the combination of the 
four directional hidden representations, which is the output 
of spatial SRN. For each direction, IT^^^ and are row 
based and column based hidden to hidden states transformation 
matrices. Wxh is the input to hidden states transformation 
matrix, bh is the bias term, and il^h is a non-linear activation 
function (ReLU is used here). 


Spatial LSTM Similar to the Equation in spatial SRN, 
each hidden unit of spatial LSTM is also a combination 
of four directional hidden representations. To make the expres¬ 
sion concise, only functions corresponding to the \ direction 
will be expanded here (corresponding to Equation 


+ (15) 

+ +bf^) (16) 

+ (17) 

+ + bg^) (18) 

C^,c) ^ ^ ^ .^,c) Q 

^^,c) (20) 


where (r, c) is the current state position, represents the 
current input data. correspond to 

the input, forget, output, and input modulation gates, 
denotes the memory cell unit, and finally is the hidden 
representations of in top-left to bottom-right direction. 
Similarly, the other three directions can be achieved. 

For each gate function, 13/^^, W^p and bp (p e 

{i,/, o, ^}) are hidden-gate (row), hidden-gate (column), 
input-gate transformation matrices and bias terms respectively. 
a and 0 are non-linear activation functions, in this manuscript, 
sigmoid is used as a, and tangent is assigned as 0. 

2) Modeling Scale Contextual Dependencies: Besides spa¬ 
tial contextual dependencies, there also exist scale contextual 
dependencies among image regions from the same locations 
but at different scales, which is another important clue for 
image recognition. For example, again in an image with 
label “computer room”, knowing the global pattern “computer 
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room” can help the system to increase the preference of 
representing patterns correspond to “computer” and “desk” 
in local level scales. In this subsection, we will focus on 
modeling scale dependencies. 

The final goal of image classification is to achieve good 
image-level representation, which is based on well-performed 
local image region representations. When describing a local 
image region, the traditional way is to only encode its own 
information. In contrast, if information from higher level scale 
regions is given, then the global information would be encoded 
in local features, and lead to better local descriptions. Thus, 
we build connections across regions from different scales. 

For each element at each scale, its receptive field covers a 
number of elements at the lower level scales. More intuitively, 
as shown in the middle part of Figure areas highlighted 
with yellow at the scale I -f 1 and / + 2 are covered by the 
receptive field of the yellow element at the scale 1. Thus, global 
information from the higher level scale I would be transferred 
to the corresponding areas at the lower level scales I -f 1 and 
/ + 2. Thus, for element at position (ri^ci) on scale /, the scale 
dependencies from higher level scales can be encoded as: 

i-i 

^(n.c) ^ (21) 

i=i 

where I G [2, • • • ,1/], and L is the number of scales. (rj^Cj) 
is the position at the higher level scale j. is scale 

contextual element (already combined the four directional 
spatial dependencies, refer to Equation from the higher 
level scale. Wji is the scale j to scale I transformation matrix. 

3) HRNNs with Spatial & Scale Dependencies: By insert¬ 
ing Equation [2T] into Equation (spatial SRN) or Equation 

T5p0| (spatial LSTM), scale and spatial dependencies can be 
modeled together in our hierarchical RNNs. 

HSRN Take the top-left to right-bottom directional HSRN as 
an example (refer to Equation [T^, the hidden representation 
of each element is: 

+ + ( 22 ) 

1-1 

g(r,c) ^ Y 

i=i 


HLSTM Similarly, for the HLSTM model (refer to Equation 
T5p^ , the hidden representation of each gate functions is: 


7(r-l,c) 7(r,c-l) 

g{r,c) ^ Y 

i=i 

P e {ij,d,g] 


in which, a is sigmoid function when p G {i, /, d}, and a is 
tangent when p = g. 


Eor both Equation 22 and 23 the scale index I of each 


variable is removed for the convenience of expression. Simi¬ 
larly, expressions of the other three directions can be obtained. 
Afterwards, refer to Equation [T^ by combining the revised 
four directional hidden representations, the complete HRNNs 
(for both HSRN and HLSTM) hidden element expression is: 


^ (24) 

According to the RNNs optimization notes in |T^ , RNNs 
can be simply and effectively optimized by back propagation 
through time (BPTT). In BPTT, the recurrent nets would be 
unfolded into feed-forward deep networks, then normal back- 
propagation can be applied. Utilizing the “weight sharing” 
setting in Caffe (5T), BPTT can be performed with shared 
RNN weights. 


D. Fully Connected Layers 

Different from applications like image labeling, where the 
label of each pixel or patch level image region 

is given, in image classification, there is no intermediate 
labels except the overall image-level label. Thus, Equation 
1^ (corresponds to HSRN) or Equation (corresponds to 
HLSTM) cannot be directly applied. To connect the hier¬ 
archical recurrent layers (Equation 'M) to the image labels, 
fully connected layers are introduced to merge the information 
learned by different scales of HRNNs: 


g = xl^g{WhgH + bg) (25) 

y = ‘Po{Wgog + ho) (26) 

where H =[h[,--- 

and = 

where Whg is the fully connected transformation matrix 
to transform the HRNNs output H to the global hidden 
layer g. H is the concatenation of HRNN layer outputs hi 
(I G [1, • • • 5 ^]) ^1 different scales. Eor each scale, hi is the 
concatenation of all its hidden element expressions hp'^^^ 
(ri G [1, • • • ,Ri], Cl G [1, • • • ,Ci], Ri and Ci are the number 
of rows and columns at the scale I respectively). Wgo is learned 
to connect g with the class label y, hg and bo are the bias terms. 
i^g is a non-linear activation function (ReLU is used in this 
manuscript), and po is the softmax function for classification. 


IV. Experiments 

In this section, detailed network settings of our end-to- 
end C-HRNNs are firstly introduced. Next, C-HRNNs are 
compared with other popular methods on four challenging 
object/scene image classification benchmarks: ILSVRC 2012 
1^, Places 205 0, SUN 397 and MIT indoor 
Afterwards, effectiveness of different modules of C-HRNNs 
is analyzed, C-HSRN and C-HLSTM are compared in detail. 


A. Experimental Settings 

Eollowing the default data prepossessing settings in Caffe 
all images are resized to 256 x 256 pixels and subtracted 
by the pixel mean. Eor training images, 10 sub-crops of size 










7 


Models 

convl 

conv2 

conv3 

conv4 

conv5 

hrnn6 

fc7 

fc8 

Alex-net 

96x11x11 

St. 4, pad 0 
LRN, x2 pool 
map 27x27 

256x5x5 

St. 1, pad 2 
LRN, x2 pool 
map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

256x3x3 

St. 1, pad 1 
x2 pool 
map 6x6 

- 

4096 
drop¬ 
out 0.5 

4096 
drop¬ 
out 0.5 

SPP-net 

@ 

96x7x7 

St. 2, pad 1 
LRN, x2 pool 
map 55x55 

256x5x5 

St. 2, pad 0 
LRN, x2 pool 
map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

256x3x3 

St. 1, pad 1 

{x2, x4, x7, xl3}pool 
map {6x6, 3x3, 2x2, 1x1} 

- 

4096 
drop¬ 
out 0.5 

4096 
drop¬ 
out 0.5 

C-HRNNs 

96x7x7 

St. 2, pad 1 
LRN, x2 pool 
map 55x55 

256x5x5 

St. 2, pad 0 
LRN, x2 pool 
map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

384x3x3 

St. 1, pad 1 

map 13x13 

256x3x3 

St. 1, pad 1 

{x2, x4, x7, xl3}pool 
map {6x6, 3x3, 2x2, 1x1} 

{36, 9, 4, l}x256xlxl 
st.l, dropout 0.5 

map {6x6, 3x3, 2x2, 1x1} 

4096 
drop¬ 
out 0.5 

4096 
drop¬ 
out 0.5 


TABLE I: Network structures. C-HSRN and C-HLSTM share the same structure, which are indicated as C-HRNNs. 


224 X 224 (1 center, 4 corners, and horizontal flips) are 
extracted. In the remaining part of this section, if not specified, 
the results are the Top 1 accuracy (or error rates) tested with 
center crop by using a single model. 

As shown in Tablejfi detailed layer structures of the baseline 
deep nets (Alex-net Q, SPP-net Q) and C-HRNNs are given. 

Comparing with SPP-net, our C-HRNNs has the same first 
five convolutional layers: 96(7 x 7), 256(5 x 5), 384(3 x 3), 
384(3 X 3), and 256(3 x 3) respectively. Strides of the first two 
layers are 2, and the rest are 1. Following each of the first, 
second and fifth convolutional layers, there is a max pooling 
layer with kernel size of 3 x 3, and stride of 2. Finally, size 
of output feature maps of the fifth CNN layer is 256x6x6 
(number of channels x height x width). Similar to SPP-net, 
we pool the feature maps into four scales, and achieve response 
maps with size of {6 x 6,3 x 3, 2 x 2,1 x 1}. 

Different from all of these baseline networks, our C-HRNNs 
introduce hierarchical recurrent layers (hrnn6 as shown in 
Table 1^. For the hierarchical recurrent layers, we process three 
scale spatial RNN layers with size of {6 x 6,3 x 3,2 x 2} 
and one global 1x1 pooling layer, and build cross scale 
connections among all these four scales. The corresponding 
numbers of image regions of the four scales are 36, 9, 4, 
and 1 respectively, and each region is represented as a 256- 
dimensional feature vector (number of channels in the fifth 
layer CNN). For each RNN layer, sizes of the transformation 
matrices (HSRN: 


22 


HLSTM: to Equation 

TiTthe four directional “2D sequences” are 256 x 256. 

To show the performance gain of introducing spatial and 
scale dependencies separately, we introduce an intermediate 
network called convolutional multi-scale recurrent neural net¬ 
works (C-MRNNs), which only considers spatial dependen¬ 
cies in multiple scales, and ignores the scale dependencies. 
Specifically, convolutional multi-scale simple recurrent neural 
network (C-MSRN) and convolutional multi-scale long-short 
term memory neural network (C-MLSTM) are tested in the 
experiments. Furthermore, when all the hidden-hidden weights 
Whh in C-MSRN are set to 0, and the input-hidden weights 
Wih are identity matrices, C-MSRN degenerates to SPP-net. 

For the fully connected layers, the number of output units of 
both two layers is 4096, and each of them is applied dropout 
at the rate of 0.5. 

The training batch size is 256, learning rate starts from 0.01 
and it is divided by 10 when the accuracy stops increasing. 


and the weight of momentum is 0.9. All the experiments are 
run on Caffe ||^ with a single NVIDIA Tesla K40 GPU. 

B. Experimental Results 

1) Experimental Results on ILSVRC 2012: ImageNet 
Large-Scale Visual Recognition Challenge (ILSVRC) dataset 
is one of the most challenging and popular large-scale 
object image classification datasets. ILSVRC 2012 contains 
1.2 million training images and 50,000 validation images (50 
per class), and they belong to 1000 object categories. 

In the upper part of Table |I^ we compare C-HRNNs with 
SPP-net llj, which encodes the spatial information by using 
spatial pyramid pooling. Based on their released modeQ we 
can only achieve 41.47% top-1 error rate with one testing 
view. Therefore, we further tune the model with the settings in 


and H7 refer to Equation 


Section [rV-A[ and finally achieve 38.21% (reported 38.01%) 
for SPP-net. The performance gap might be caused by the 
different training settings (when preprocess images, SPP-net 
keeps the original image aspect ratio, while the standard 
Caffe | [5T| does not). C-HRNNs and SPP-net use the same 
convolutional and fully connected layer settings, except SPP- 
net directly applied {6x6,3x3,2x2,1x1} spatial pyramid 
pooling after the fifth convolutional layer, while our C-HRNNs 
model spatial dependencies with RNN for each scale, and 
models scale dependencies across different scales. 

For SRN models, comparing with SPP-net, C-MSRN brings 
1.31% Top 1 error rate decrease, which indicates the benefit 
of modeling spatial dependencies. After integrating the scale 
dependencies, C-HSRN is 1.83% better than SPP-net. Thus, 
both encoding spatial and scale dependencies can help to 
generate better image representations. 

In LSTM models, performance improvement introduced 
by modeling spatial and scale dependencies can also be 
observed. Different from SRN models, LSTM models are able 
to keep longer-term memory of image region “2D sequences”. 
Comparing with SPP-net, C-HLSTM is 2.36% better. C- 
MLSTM gets 36.01%, which is better than C-MSRN, and 
C-HLSTM (35.85%) also works better than C-HSRN. But the 
performance gap between C-HLSTM and C-HSRN (0.53%) is 
less than the one between C-MLSTM and C-MSRN (0.89%). 
The reason should be that the introduced scale dependencies 
from higher scales indirectly extend the long-term ability of C- 
MSRN, and indent the gap between SRN and LSTM models. 

^ https://github.com/ShaoqingRen/SPP_net 
















Methods 

test scales 

test views 

Top 1 val 

Top 5 val 

MOP-CNN H (max pooling) 

3 

101 

44.12% 

- 

MOP-CNN PI (VLAD pooling) 

3 

101 

42.07% 

- 

SPP-net 

1 

1 

38.21% 

- 

C-MSRN“^ 

1 

1 

36.90% 

- 

C-HSRN 

1 

1 

36 . 38 % 

- 

C-MLSTM 

1 

1 

36.01% 

- 

C-HLSTM 

1 

1 

35 . 85 % 

- 

Alex-net 

1 

10 

40.7% 

18.2% 

ZF-net (4^ 

1 

10 

38.4% 

16.5% 

Overfeat^QO) 

1 

10 

35.6% 

14.7% 

SPP-net (y 

1 

10 

36.2% 

14.9% 

C-MSRhP 

1 

10 

35.2% 

14.0% 

C-HSRN 

1 

10 

34 . 8 % 

13 . 7 % 

C-MLSTM 

1 

10 

34.5% 

13.5% 

C-HLSTM 

1 

10 

34 . 3 % 

13 . 4 % 


TABLE II: Comparison of error rates on the ILSVRC 2012 validation set. 


Methods 

Top 1 val 

Top 5 val 

Alex-net 0 

,|55) 

50.06% 

80.51% 

SPP-net Iff 

51.57% 

81.88% 

C-MSRN"^ 


52.70% 

82.75% 

C-HSRN 


53 . 16 % 

83 . 07 % 

C-MLSTM 


53.75% 

83.36% 

C-HLSTM 


53 . 91 % 

83 . 48 % 


TABLE III: Comparison of accuracy on Places 205. 


We also compare with another spatial statistics based CNN 
method MOP-CNN 0, which directly uses the Caffe CNN 
(D to densely extract features from three-scale image patches, 
and use VLAD pooling to generate global representations. The 
performance gap indicates that our way of encoding spatial and 
scale information is more effective. 

In the lower part of Table |I^ C-HRNNs are compared with 
other deep neural networks with the most general settings: 10 
testing views, comparing Top 1 and Top 5 error rates. Out¬ 
standing performances of our C-HRNNs indicate that besides 
going deeper and wider, RNN is another promising way to 
increase the image representation power of neural networks. 

2) Experimental Results on Places 205: Places 205 dataset 
| [55| is currently the largest scene categorization dataset, which 
has just been released at the end of 2014. Different from 
ILSVRC 2012, it focuses on scene images rather than object 
centric ones. It has 2.5 million training images from 205 scene 
categories, which is twice the size of ILSVRC 2012, and much 
more challenging. There are 20,500 images (100 per category) 
in the validation set. 

As shown in Table |I^ C-HLSTM update the state-of-the-art 
on Places 205 (previous best result was 50.06% achieved by 
Alex-net) with the accuracy of 53.91%. When only introducing 
spatial dependencies, C-MSRN and C-MLSTM outperform 
SPP-net by 1.13% and 2.18% respectively. Eurther integrating 
scale dependences, C-HSRN and C-HLSTM bring 1.59% and 
2.34% improvements respectively. 

3) Experimental Results on SUN 397: SUN 397 | [5^ is 
another popular large-scale scene image recognition bench¬ 
mark. There are 100,000 images from 397 scene classes in 
total. The general splittings in are used here, in which, 
there are 50 images per class for training, and 50 images 
per class for testing. Since the number of training images is 


Methods 

Accuracy 

MOP-CNN H (max pooling) 

48.50% 

MOP-CNN m (VLAD pooling) 

51 . 98 % 

Alex-net gllLSVRC ft) 

44.42% 

Alex-net 0 (Places ft) 

54.55% 

SPP-net Iff (ILSVRC ft) 

49.02% 

C-MSRNllLSVRC ft) 

51.76% 

C-HSRN (ILSVRC ft) 

52 . 59 % 

C-MLSTM (ILSVRC ft) 

52.67% 

C-HLSTM (ILSVRC ft) 

52 . 78 % 

SPP-net (§ (Places ft) 

57.23% 

C-MSRNlPlaces ft) 

59.32% 

C-HSRN (Places ft) 

59 . 90 % 

C-MLSTM (Places ft) 

60.08% 

C-HLSTM (Places ft) 

60 . 34 % 

Xiao et al. 

38.00% 

IFV 

47.20% 

MTL^CA 

49.50% 


TABLE IV: Comparison of accuracy on SUN 391} 


too small (20,000), we introduce the models pre-trained on 
ILSVRC 2012 and Places 205, and use the training images 
from SUN 397 to fine-tune the network. We also increase the 
learning rates of the HRNN layers (10 times higher than the 
other layers), and aim to focus more on spatial dependencies 
specifically exist in SUN 397. 

As shown in the upper part of Table |IVj our C-HRNNs 
performs better than existing CNNs. After fine-tuning on 
SUN 397, C-HRNNs are able to learn more data adaptive 
spatial dependencies and significantly outperform SPP-net: 1) 
Based on models pre-trained on ILSVRC, the performance 
gains of C-HSRN and C-HLSTM are 3.57% and 3.76% 
respectively; 2) Based on models pre-trained on Places, the 
accuracy improvements of C-HSRN and C-HLSTM are 2.67% 
and 3.11% correspondingly. 

When applying fine-tuning based on Places 205, C-HLSTM 
achieves the state-of-the-art on the SUN 397 with the accuracy 
of 60.34%, which outperforms the previous best result (MOP- 
CNN 51.98%) by 8.36%. Another observation is that the 
performances of the fine-tuned models based on Places 205 
consistently perform better than the ones based on ILSVRC 
2012. The reason should be that both Places 205 and SUN 
397 are scene datasets, their domain gap is smaller than the 
gap between ILSVRC 2012 (object dataset) and SUN 397. 
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Methods 

Accuracy 

MOP-CNN 0 (max pooling) 

64.85% 

MOP-CNN m (VLAD pooling) 

68 . 88 % 

Alex-net 011LSVRC ft) 

61.57% 

Alex-net ^ (Places ft) 

68.24% 

SPP-net (W (ILSVRC ft) 

66.32% 

C-MSRNllLSVRC ft) 

68.28% 

C-HSRN (ILSVRC ft) 

68 . 88 % 

C-MLSTM (ILSVRC ft) 

69.18% 

C-HLSTM (ILSVRC ft) 

69 . 25 % 

SPP-net 0 (Places ft) 

72.09% 

C-MSRNlPlaces ft) 

74.18% 

C-HSRN (Places ft) 

74 . 85 % 

C-MLSTM (Places ft) 

75.30% 

C-HLSTM (Places ft) 

75 . 67 % 

Object Bank |58| 

37.60% 

Visual Concepts 

46.40% 

MMDL (60) 

50.15% 

IFV 

60.77% 

MLrep + IFV_^ 

66.87% 

ISPR + IFV (6^ 

68.50% 


TABLE V: Comparison of accuracy on MIT indoor.^ 


The lower part of Table |IV| shows the traditional state-of- 
the-art shallow methods. Most of these works heavily depend 
on combining multiple densely extracted hand-crafted features, 
and the image level representations are usually very high¬ 
dimensional. Another drawback of these methods is that the 
testing procedures are generally very time consuming, since 
the feature extraction steps are slow. Comparing with them, our 
C-HRNNs perform much better with much less computational 
cost in testing, and much lower-dimensional features. 

4) Experimental Results on MIT Indoor: MIT indoor | [54t 
is very challenging scene image classification benchmarks. 
This dataset focuses on indoor scene scenarios, which usually 
contains lots of objects, and has larger variations. There are 67 
different scene scenarios in MIT indoor in total, and the widely 
used splitting provided by l l54| are applied in our experiments. 
In each class, around 80 training images, and around 20 testing 
images are selected. Because of the limitation of dataset size, 
we also utilize the pre-train models on ILSVRC 2012 and 
Places 205, and do fine-tuning. For the other baseline deep 
neural networks, such fine-tuning is also applied. 

As shown in the upper part of Table |Vj C-HRNNs are 
able to outperform the other deep neural nets with obvious 
gaps. Comparing with the state-of-the art MOP-CNN, our C- 
HLSTM achieves the accuracy of 75.67%, which is 6.79% 
better. Comparing with SPP-net: 1) Based on models pre¬ 
trained on ILSVRC, the performance gains of C-HSRN and 
C-HLSTM are 2.56% and 2.93% respectively; 2) Based on 
models pre-trained on Places, the accuracy improvements of 
C-HSRN and C-HLSTM are 2.76% and 3.58% respectively. 

In the lower part of Table |Vj results of the state-of-art 
traditional shallow methods are given. Although very powerful 
on MIT Indoor, these methods cost much more computation 
power to perform middle-level patch searching and clustering, 
the feature dimensions are relatively high, and most of them 
can hardly be applied on large-scale benchmarks. In contrast, 
our C-HRNNs are end-to-end feature learning frameworks 

^ILSVRC ft and Places ft represent the models fine-tuned based on the 
models pre-trained on ILSVRC 2012 and Places 205 respectively. 


with 4096-dimensional output features, and C-HRNNs can 
easily handle large-scale data. 


C. Analysis of C-HRNNs 

In this subsection, we will analyze the effectiveness of our 
C-HRNNs from different perspectives. 

1) C-HRNNs Visualization: Firstly, patterns learned by the 
hmn6 layer (refer to Table of C-HLSTM are visualized 
in Figure On the left part of Figure six testing image 
region on the {3x3} scale (refer to Section [iV-A ) are given, 
and the receptive field of each region is highlighted with 
blue box in the original image. On the right part of Figure 
1^ the top 8 nearest neighbors of each testing image region 
are shown. These nearest neighbors are searched from all 
the local region features extracted from training images, and 
measured by distance. For every two rows, the top row is 
the nearest neighbors searched by utilizing C-HLSTM hrnn6 
layer features, and the bottom row is the results of using SPP- 
net conv5 layer features. 

Comparing the visualization results of our C-HLSTM with 
the SPP-net, we can observe obvious better local image region 
representations. Take the first testing image region as an 
example, it is the right bottom area in the “radiator grille” 
image. By using our C-HLSTM, this region is more likely 
to be represented as the “radiator grille” from the same or 
different car models, and less likely to be mismatched to 
similar patterns from other unrelated classes. Take the last test¬ 
ing image “partridge” as another example. The testing region 
contains “body of partridge” and the background “gravel”. For 
our C-HLSTM, contextual information has been taken into 
consideration, thus, this region is represented as the “body 
of partridge”. In contrast, the SPP-net wrongly focused on 
“gravel”, and missed the target object. Similarly, better local 
region visualization results of C-HLSTM can be observed in 
other classes, such as man-made buildings like “steel arch 
bridge”, creatures like “sea urchin” etc. 

2) C-HRNNs vs Modified CNNs: Since C-HRNNs have 
more parameters than the original CNNs, we aim to quanti¬ 
tatively show whether the performance gain is from encoding 
contextual dependencies, or simply from increasing the num¬ 
ber of parameters. 

For each HRNN scale, there are four directional “2D 
sequences”. In HSRN, each direction has three transformation 
matrices Wxh\ While in HLSTM, each 

direction has four gate functions, and in each gate, there 
are and Wxp. Thus, for each scale, there are 12 

transformation matrices in HSRN, and 48 matrices in HLSTM. 
Furthermore, in both HSRN and HLSTM, there are 6 cross 
scale transformation matrices Wji. Each of these matrices 
has the size of 256 x 256, which has the same number of 
parameters as one convolution layer with 256 kernels of size 
1 X 1 X 256. Thus, in the modified CNNs, each transformation 
matrix in HRNN is replaced with a convolution layer. 

Testing on ILSVRC 2012, HSRN gets 36.38% in error rate, 
while the modified CNN gets 37.73%. Similarly, HLSTM gets 
35.85% in error rate, while the modified CNN gets 37.49%. 
These obvious gaps indicate that RNNs are able to learn 
contextual dependencies which cannot be captured by CNNs. 
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Fig. 4: Visualization of the hmn6 (C-HLSTM)/conv5 (SPP-net) layer features. The colored rectangles mark the receptive fields 
of each image region. First column on the left: testing image regions (blue rectangles) at scale 3x3. For each testing region, 
the regions on the right are the top 8 nearest neighbors searched by using C-HLSTM tirnnb features (top row), and SPP-net 
conv5 features (bottom row). Green rectangles are the nearest neighbors with the same labels as the testing image region, red 
rectangles are the ones with the wrong labels. (Best viewed in color) 
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3) Effect of Number of Spatial Context Directions: In C- 
HRNNs, four directional “2D sequences” are employed, can 
they really learn complementary information to each other? 

In Table |V^ performance of C-HRNNs with different di¬ 
rections of spatial context are given. The first four rows show 
the performances of using single directional “2D sequence”, 
and different direction performs similarly to each other. On 
the last three rows of Table |VI| results of combination of 
two directions, and the complete four directions are given. 
Comparing the results of using two directions and single 
direction, improvements can be observed. When combining 
all four directions, the best performance can be achieved. 


Methods 

Error 

Methods 

Error 

C-HSRNi(\) 

36.96% 

C-HLSTMi(\) 

36.39% 

C-HSRNi(\) 

37.03% 

C-HLSTMi(\) 

36.46% 

C-HSRNi IX) 

36.95% 

C-HLSTMiU) 

36.45% 

C-HSRNi {/') 

36.89% 

C-HLSTMi(/’) 

36.50% 

C-HSRN 2 (^) 

36.85% 

C-HLSTM 2 (\\) 

35.97% 

C-HSRNzU/^) 

36.59% 

C-HLSTM 2 (^/^) 

36.03% 

C-HSRN 4 

36.38% 

C-HLSTM 4 

35.85% 


TABLE VI: Error rates of applying C-HRNNs with different 
spatial contextual directions on ILSVRC 2012. C-HSRNi and 
C-HLSTMi use one directional HRNNs; C-HSRN 2 and C- 
HLSTM 2 utilize two directions; C-HSRN 4 and C-HLSTM 4 
use all the four directions. 

4) HRNNs Complexity: Although very powerful, our 
HRNNs do not bring much extra computational burden or 
memory usage. 

There are three scale HRNN layers with spatial dependen¬ 
cies encoded: 6 x 6 , 3 x 3, and 2x2, each of them has 
12 transformation matrices in HSRN and 48 transformation 
matrices in HLSTM, and there are 6 hierarchical connections 
(3 from 1x1,2 from 2x2, and 1 from 3 x 3). Thus, there are 42 
transformation matrices in HSRN and 150 matrices in HLSTM 
in total, each one has size of 256 x 256. Thus, the HSRN 
layers have 2, 752, 512 parameters, and the HLSTM layers 
have 9,830,400 parameters. The number of parameters in 
HLSTM is almost four times HSRN, which make the HLSTM 
models be able to learn more complex patterns with the price 
of more computation resources. In contrast, in CNN, the fully 
connected layers have most of the network parameters, e.g. 
the second fully connected layer needs to learn a 4096 x 4096 
weight matrix, which has 16, 777, 216 parameters, comparing 
with which, our HRNN layers have much fewer parameters. 

In terms of memory consumption, HRNN layers do not cost 
much extra memory except some intermediate hidden layer 
output, i.e. and gate units (only exist in HLSTM) 
for each image region, which are 256-dimensional vectors. 
While in CNN, the most memory consuming part is the first 
convolutional layer. In our setting, output of the first CNN 
layer has 1,161,600 dimensions, comparing with which, the 
HRNNs cost negligible memory to save intermediate data. 

5) C-HRNNs Success & Failure Cases: In Eigure the 
final classification results of using C-HLSTM and SPP-net 

on SUN 397 (fined-tuned based on ILSVRC 2012 models) are 
visually compared. We show the images on which C-HLSTM 
leads to the highest accuracy improvement (the two rows above 


the dashed line), and the images on which C-HLSTM leads 
to the highest accuracy drop (the row below the dashed line). 

Prom the first two rows of Eigure we can clearly observe 
that the SPP-net focuses on predicting image regions, while 
ignoring the contextual information. Por example, the label 
of the first image in the first row is “landing deck”, our C- 
HLSTM can correctly recognize it, while the SPP-net wrongly 
recognizes it as the “windmill” with a very high confidence 
score. It’s because SPP-net wrongly recognized the rotor 
blades of the helicopter as the windmill blades. In contrast, 
our C-HLSTM takes the context such as the body of helicopter 
and deck into consideration. Thus, our C-HLSTM works better 
when the local image regions are confusing, but contextual 
information can help to make better decisions. 

In the third row of Eigure we observe some interesting 
results. Por example, the first image of the third row is “rope 
bridge”, which is relatively small in the image, while the forest 
is more obvious. Thus, our C-HLSTM wrongly recognize it as 
“rainforest”. Por the third image of the third row, the first word 
that comes to mind is cliff, while the ground truth label “light 
house” just represents a small region on the “cliff’. Thus, our 
C-HLSTM makes mistakes when class labels are based on 
local regions rather than the global image. 

V. Conclusions 

In this manuscript, we propose an end-to-end deep learning 
framework to encode spatial and scale contextual dependencies 
in image representation, which is called C-HRNNs. In C- 
HRNNs, CNN layers are firstly utilize to extract middle- 
level representations for local image regions. Based on the 
CNN layer outputs, our proposed hierarchical recurrent layers 
are then applied to model the spatial dependencies among 
different image regions from the same scale, and the scale 
dependencies among image regions from different scales but 
at the same locations. In our proposed hierarchical recurrent 
neural networks, HSRN and HLSTM are introduced as two 
specific instances, which correspond to a fast recurrent model, 
and a sophisticated but more effective recurrent model respec¬ 
tively. By integrating CNN and HRNNs, our C-HRNNs show 
outstanding performances on image classification. 
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Fig. 5: C-HLSTM vs SPP-net, result on SUN 397. The two rows above the dashed line: images misclassified by SPP-net, but 
correctly classified by C-HLSTM. The row below the dashed line: images correctly classified by SPP-net, but misclassified 
by C-HLSTM. Under each image, the first row shows the predicted label of using C-HLSTM, and the second row shows the 
predicted label of using SPP-net, the prediction confidence scores are shown in the bracket, and correct labels are in bold. 
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